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Abstract 

Overcomplete representations and dictionary learning algorithms kept at- 
tracting a growing interest in the machine learning community. This paper 
addresses the emerging problem of comparing multivariate overcomplete 
representations. Despite a recurrent need to rely on a distance for learn- 
ing or assessing multivariate overcomplete representations, no metrics in 
their underlying spaces have yet been proposed. Henceforth we propose 
to study overcomplete representations from the perspective of frame the- 
ory and matrix manifolds. We consider distances between multivariate 
dictionaries as distances between their spans which reveal to be elements 
of a Grassmannian manifold. We introduce Wasscrstcin-likc set-metrics 
defined on Grassmannian spaces and study their properties both theo- 
retically and numerically. Indeed a deep experimental study based on 
tailored synthetic datasctsand real EEG signals for Brain-Computer In- 
terfaces (BCI) have been conducted. In particular, the introduced metrics 
have been embedded in clustering algorithm and applied to BCI Compe- 
tition IV-2a for dataset quality assessment. Besides, a principled connec- 
tion is made between three close but still disjoint research fields, namely, 
Grassmannian packing, dictionary learning and compressed sensing. 
Keywords: Dictionary Learning, Metrics, Frames, Grassmannian Man- 
ifolds, Packing, Compressed Sensing, Multivariate Dataset. 



1 Introduction 



A classical question in machine learning is how to choose a good feature 
space to analyze the input data. One possible and elegant response is to 
learn this feature space over mild hypotheses. This approach is known as 
dictionary learning and follows the dictionary-based framework which have 
provided several important results in the last decades (Meyer, 1995; Mal- 
lat, 1999; Ti^opp, 2004; Candes et al., 2006; Gribonval and Schnass, 2010; 
Jenatton et al., 2012). In dictionary-based approaches, an expert selects a 
specific family of basis functions, called atoms, known to capture important 
features of the input data, for example wavelets (Mallat, 1999) or curvelets 
(Candes and Donoho, 2004). When no specific expert knowledge is avail- 
able, dictionary learning algorithms could learn those atoms from a given 
dataset. The problem is then stated as an optimization procedure usually 
under some sparsity constraints. Thanks to the pioneering works on sparsity 
constraint decomposition of Mallat and Zhang (1993), on the Lasso prob- 
lem (Tibshirani, 1996) and on sparse signal recovery (Candes et al., 2006), 
efficient algorithms are available both for projection on overcomplete rep- 
resentations (Efron et al., 2004; Beck and Teboulle, 2009) and dictionary 
learning (Engan et al., 2000; Aharon et al., 2006; Mairal et al., 2010). 

Despite the very profusion of research papers dealing with dictionary- 
based approaches and overcomplete representations in general, except some 
noticeable exceptions (Balan, 1999; Skretting and Engan, 2010), few results 
have been reported on how the constructed representations should be com- 
pared, and in a more theoretical standpoint, on which topological space 
these representations live. 

Thus, to qualitatively assess a specific dictionary learning algorithm, one 
has to indirectly evaluate it through a benchmark based on a task perfor- 
mance (Engan et al., 2000; Grosse et al., 2007; Mairal et al., 2009; Skretting 
and Engan, 2010). Meanwhile, one can find in the literature some hints 
for dictionaries comparison with the aim of learning assessment (Aldroubi, 
1995; Aharon et al., 2006; Vainsencher et al., 2011; Skretting and Engan, 
2011), but they fall short to define a true metric. A recurrent view to dictio- 
nary comparison is based on the cross-gramian matrix and is only defined 
on univariate dictionaries. In Lesage (2007), it is called a correlogram and 
measures, such as precision and recall, were defined. In Aldroubi (1995), dic- 
tionaries are compared through the analysis Gram operator induced norm. 
The work of Balan (1999) introduces a mapping from the frame sets to a 
continuous functional space. The constructed functions allow then for the 
definition of an equivalence class and a partial order. Nonetheless this ap- 
proach is not invariant of linear transformation between frame elements. 

In harmonic analysis, overcomplete dictionaries are called frames (Chris- 
tensen, 2003). Frames are tightly linked with the work of Gabor, as they 
were formally defined by Duffin and Schaeffer (1952) and later popularized 
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by Daubechies et al. (1986) to be widely studied as a theoretical framework 
for wavelets (Mallat, 1999). They have been used in numerous domains, 
for example signal processing, compression or sampling. They have also 
been actively investigated for solving packing problems (Conway et al., 1996; 
Strohmer and Heath Jr., 2003; Dhillon et al., 2008). In its original formu- 
lation by Tammes (1930), the packing problem in a compact metric space 
consists in finding the best way to embed points on the surface of a sphere 
such that each pair of point is separated by the largest distance as possible. 
When considering the packing of subspaces, which has applications in wire- 
less communications, it has been the topic of a very active community, see 
for example (Casazza and Kutyniok, 2004; Kutyniok et al., 2009; Strawn, 
2012). 

Of particular interest is the remarkable equivalence between the Equian- 
gular Tight Frames design and the Grassmannian packing problem (Strohmer 
and Heath Jr., 2003). This opens new exciting perspectives on exploiting 
the theoretical results on Grassmannian manifolds and bring them to the 
context of frame theory and inherently overcomplete dictionaries. 

Several metrics have been proposed in Grassmannian spaces, such as the 
chordal distance (Conway et al., 1996; Hamm and Lee, 2008), the projection- 
norm (Edelman et al., 1999) or the Binet-Cauchy distance (Wolf and Shashua, 
2003; Vishwanathan and Smola, 2004). These distances have been used to 
define a reproducing kernel Hilbert space, for SVM-based approaches or to 
find the best packing with a given subspace, and were also used in image 
processing, see Lui (2012) for a review. 

The Grassmannian spaces define a natural framework to work with mul- 
ticomponent dataset. This opens new approaches to account for finding 
spatial relations in multivariate time-series or in multichannel images. Pre- 
vious works on dictionary learning in a multivariate context are applied on 
multimodal audio-visual data (Monaci et al., 2007, 2009), electro-cardiogram 
signals (Mailhe et al., 2009), color images (Mairal et al., 2008), hyperspectral 
images (Moudden et al., 2009), stereo images (Tosic and Frossard, 2011b), 
2D spatial trajectories (Barthelemy et al., 2012) and electro-encephalogram 
(EEG) signals (Barthelemy et al., 2013a). 

Main contribution of the paper is to introduce metrics suited to mul- 
tivariate dictionaries, which are invariant to linear combinations (see Sec- 
tion 5). To define these metrics, intermediate results were necessary. In par- 
ticular, transport metrics are defined over Grassmannian manifolds. A prin- 
cipled connection has been made between multivariate dictionaries, frame 
theory and Grassmannian manifolds by exploiting the connection between 
the frame design problem and the Grassmannian packing (Section 4). 

In the sequel, we first begin by recalling some definitions about Grass- 
mannian manifolds and their metrics in Section 2, supplementary material 
on matrix manifolds are given in Appendix A. In Section 3, necessary ma- 
terial from frame theory is recalled. An original formalized connection be- 
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tween Grassmannian frames, compressed sensing and dictionary learning is 
detailed in Section 4. In Section 5, distances between overcomplete repre- 
sentations are constructed as Wasserstein-Iike set metrics where ground met- 
rics are those defined on Grassmannian manifolds. In Section 6, dictionary 
learning principle is detailed for multivariate data, with different models. 
Assessment on synthetic data is presented in Section 7 with experimental 
results for several multivariate dictionary learning algorithms, that is over- 
complete representations on multichannel dataset. Proposed metrics allow 
to estimate the empirical convergence of dictionary learning algorithms and 
thus to evaluate the contribution of these different algorithms. Experiences 
on real dataset are shown in Section 8, exploiting the overcomplete repre- 
sentations to assess the quality of training dataset through unsupervised 
analyses. A well known multivariate dataset is investigated, namely the 
Brain-Computer Interface Competition IV-2a. An evaluation of the subject 
variability and of the difference between the training and testing datasets is 
produced, hence offering novel insights on the dataset consistency. Section 9 
concludes this paper and points out some future research directions. 

2 Grassmannian Manifolds and their Metrics 

This section recalls some definitions from algebraic geometry that will be 
helpful to characterize the metrics associated with Grassmannian manifolds. 
Some basic definitions, including the Grassmannian manifold one, is given 
in Appendix A. 

2.1 Preliminaries 

We consider here an n-dimensional vector space V with inner product (•, •) 
defined over the field M. We will denote by u, w the elements (vectors) of V 
and hy U, W the matrices of dimension n x p defined as U = [ui, . . . ,Up], 
W = [wi, . . . , Wp]. The transpose of a vector u (or a matrix U) is denoted u'^ 
{U'^). The inner product induces the I2 norm ||u||2= {u,u). The Frobenius 
norm is defined as ||C/|||'= tra.ce{U'^U) and its associated inner product is 
defined as {U,W)f = trace(W'^ U) . The subspace spanned by U is denoted 
as U and its dimension (rank of U) as g, thus dim(W) = rankiU) = g. 

2.2 Principal Angles 

A central notion to characterize the distance between two subspaces is the 
principal angles, which is easy to interpret as shown on Figure 1. 

Definition 1 The principal angles ^ 9i ^ . . . ^ 6g ^ ^ between two 
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Figure 1: Illustration of the principal angles between two subspaces lA and 
W. In this example, p = q = 2, 9i is null and 62 is plotted in the figure. 

subspaces U and W are defined recursively by 
cos Ok = max max yT Zk 

S-t. \\yk\\2= 1, \\Zk\\2= 1, 

and ylyj = 0, zlzj =0, j = 0, . . . , /c - 1 . 

It should be noticed that principal angles are also defined when the dimen- 
sion of the two subspaces are different, for example r = dimW ^ dimlA = 
Q ^ 1, in which case the maximum number of principal angles is equal to 
the dimension of the smallest subspace. 

A more computationally tractable definition of the principal angles relies 
on the singular value decomposition of two bases A = {ai}i^i spanning U 
and B = spanning W, with / and J two indexing sets: 

A^S = ySZ^ = y(cos0)Z^ , 

where F = [yi , . . . , y^] , Z = [zi , . . . , z^,] . We denote by 6 the f?- vector 
formed by the principal angles 9^, k = 1, . . . , g. Here cos is the diagonal 
matrix formed by cos^i, . . . , cosOg along the diagonal. The cos 6 is also 
known as the principal correlations or the canonical correlations (Hotelling, 
1936; Golub and Zha, 1995). 

2.3 Metrics on Grassmannian Manifolds 

In the following, we will review some of the most known Grassmannian 
metrics. Even if similar reviews could be found in Edelman et al. (1999) or 
in Dhillon et al. (2008), the one presented here is more detailed and more 
complete. Most of the Grassmannian metrics are relying on principal angles. 

A metric, defined for an arbitrary non empty set X, is a function d : 
X X X M"'" = [0, 00) which verify these three axioms: 
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(Al) Separability d{x, y) = iff x = y, Vx, y £ X , 
(A2) Symmetry d{x,y) = d{y,x), Vx,?/ € X , 

(A3) Triangle inequality d{x,y) ^ d(x,z) +d(z,y), for all x,y,z G X . 

Let Gr{g,n) be a Grassmannian manifold, as defined in Appendix A. 
Let U, W be two elements of Gr(g, n) and let {6i, . . . , 9g} be their associated 
principal angles. 

The metrics on Grassmannian manifolds rely on principal angles and 
a general remark applying to all metrics based on principal angles is that 
they are significant if and only if U and W are defined in the same space 
V. Furthermore, if U and W share basis vectors, an important proportion 
of the principal angles could be null. Assuming that U and W, such that 
span(C/) = l/( and span(VF) = W, forms separated bases the maximum 
number of non-null angles is min(^, n — g). 

1. Geodesic distance. A first metric to consider is the geodesic distance 
or arc length first introduced by Wong (1967). 

Definition 2 Let lA and W he two subspaces and 9 = [^i, • • • ,6^] be the g 
principal angles between lA and W, where g is the dimension of the smallest 
subspace. The geodesic distance between U and W is: 



Nonetheless this distance is not differentiable everywhere (Conway et al.. 



2. Chordal distance. Probably the most known Grassmannian met- 
ric (Golub and van Loan, 1996; Conway et al., 1996; Wang et al., 2006). 

Definition 3 Let U and W he two subspaces and = [9i, - ■ ■ ,6g\ he the g 
principal angles vector between U and W, where g is the dimension of the 
smallest subspace. The chordal distance is defined as: 



\k=l / 

Proposition 4 (Dhillon et al., 2008) The chordal distance can be refor- 
mulated in a more computationally effective way as: 



where the columns of If and W_ are the orthonormal bases for lA and W, 
that is IJ_ '^IJ_ = Ig and span (U) =U. 




(1) 



1996). It takes values between zero and 
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The chordal distance approximates the geodesic distance when the planes 
are close and its square is differentiable everywhere. 

The chordal distance could be seen as the embedding of an element 
U of Gr(^3, n) with its projection matrix C/C/^.The embedding could be 
formalized as: 

He : Gr(^,n) ^ M"^" 

Thus the Grassmannian element is embedded in the set of orthogonal pro- 
jection matrices of rank q. The chordal distance could thus be expressed 
as: 



V2 



\\uu'^ -ww'^W 



Instead of using the Probenius norm on the embedded Grassmannian, it 
is possible to use the 2-norm. The chordal 2-norm distance writes as: 



dc2{U,W) = \\Ull- 



ww^h-- 



llsin 011 



(sin^^,) 



which fails to be a metric because of (Al), it is thus a pseudo- metric. 

A variation of the chordal metric is the projection metric, which starts 
with different assumptions but yield very similar results. The projection 
metric is the result of embedding Gr{Q,n) in the vector space M"^^: 

dr,rai(U,W)=\\UY -WZ\\f 

1 



^proi{U,W)=\\UY 

2 sin 



E 

\k=l 



sm 



The projection distance is very similar to the chordal one: it is the minimum 
distance between all the possible representations of the two subspaces U and 
W (Chikuse, 2003), which could be reformulated in term of the following 
minimization problem: 



min ll^Zi^i 
\\UY - WZ\\f 



WR 



■2\\F 



where 0{q) is the orthogonal group. 

As for the chordal distance, it is possible to rely on the 2-norm instead 
of the Frobenius norm for the projection metric, giving also the following 
pseudo-metric: 



dproj2(Z^,W) 



U_Y_ - 

e 



2 sin 



2sin^ 
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3. Fubini-Study distance. The Fubini-Study distance is derived via the 
Pliicker embedding of Gr{Q, n) into the projective space P (A^ (M")) (taking 
the wedge product between ah elements of W), then taking the Fubini-Study 
metric (Kobayashi and Nomizu, 1969). 

Definition 5 Let lA and W he two subspaces and 9 = [^i, • • • ,6g] be the g 
principal angles vector between U and W, where g is the dimension of the 
smallest subspace. The Fubini-Study distance is defined as: 



In Conway et al. (1996), the authors also argued that the Pliicker em- 
bedding does not give a way to realize a geodesic or a chordal distance as 
an Euclidean distance. 

4. Spectral distance. Introduced in Dhillon et al. (2008), the spectral 
distance is used to promote specific configurations of subspaces in packing 
problems. 

Definition 6 Let lA and W he two subspaces and = [^i, • • • ,0g\ be the g 
principal angles vector between lA and W, where g is the dimension of the 
smallest subspace. The spectral distance is defined as: 



where the spectral norm ||X||2,2 returns the largest singular value of X . 

This distance allows to solve the packing problem with equi-isoclinic con- 
figurations, that is when all principal angles formed by all pairs of subspaces 
are equal. Nonetheless this distance, defined in Equation (4), is not a metric. 

5. Binet-Cauchy distance. A last metric on Gr{g,n) is the Binet- 
Cauchy distance which is the product of principal angles or canonical cor- 
relations (Wolf and Shashua, 2003; Vishwanathan and Smola, 2004). 

Definition 7 Let lA and W he two subspaces and = [^i, • • • ,0g\ be the g 
principal angles vector between lA and W, where g is the dimension of the 
smallest subspace. The Binet-Cauchy distance is defined as: 





The Fubini-Study distance can be computed efficiently as: 
di^{U,W) = arccos(det(C/^li:)) . 



ds(Z//,W) = minsin^fc 



(4) 






k=l 
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Distance 


Definition 


Metric 


DE 


Geodesic 


d^{u,w) =\\e\\2 


/ 


X 


Chordal 


dc{U,W) =||sin0||2 


/ 


/ 


Fubini-Study 


dis{U,W) = arccos(det(I7^E/)) 


/ 


/ 


Spectral 


d,{U,W) = {l- ||C/^]£||y^ 


X 


X 


Binet-Cauchy 


dbc(Z^,W) = (l-rifccos^^fc)^ 


/ 


/ 



Table 1: Distances on Grassmannian manifolds lA and W, DE stands for 
differentiable everywhere. 

This distance was introduced to enhance the computational efficiency 
and the numerical stability of the Canonical Correlation Analysis (CCA) 
based kernel approaches. 

All these distances are summarized in the Table 1 with their definition. 

3 Frames 

We assume that the reader is familiar with frame theory and we restrict 
ourselves to introduce definitions and facts that are mandatory for further 
developments in the paper. Please refer to Christensen (2003) for a deeper 
study of the topic. 

We assume that V, together with its inner product and the induced 
norm, is a separable Hilbert space. The indexing set I is in N as we will 
consider, without loss of generality, only finite-dimensional frames. 

Definition 8 A family of elements U = {uj}jg/ is a frame for V if there 
exist < 6i ^ 62 ^ oo, such that for all w £ V: 

bi IHI'^El^^'^*)l'^^2 INI' . (6) 

iei 

If ||^tj|P= 1 for all i, it is a unit norm frame. A tight frame is a frame whose 
bounds are equal (61 = 62) and if a frame verifies 61 = 62 = !> it is a Parseval 
frame or a normalized tight frame. This inequality implies that the frame 
U is full rank (Mallat, 1999). 

Definition 9 Let {nj}jg/ be a frame for V. The Bessel map, or analysis 
operator, is a function T : V — )■ ^^(/) defined by: 

T{w) = {{w,Ui))i^i . 
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Definition 10 Let {ujjig/ be a frame for V. The pre-frame operator, or 
the synthesis operator, is a map T* : L'^{I) — s- V defined by: 



T*c = ^CiUi . 



For an orthonormal basis, these operators are invertible and are each 
other inverse. In the context of frames, it is usually not the case. Specifically, 
r* is always surjective but usually not injective and T is always injective 
but usually not surjective. 

Definition 11 The frame operator is a mapping 5 : V — t- V and is defined 
as T*T , that is 



The frame operator is well defined, bounded, self-adjoint and positive. 

4 Sparse Coding, Dictionary Learning and Grass- 
mannian Frames 

Our attempt here is to establish a principled connexion between three close 
but still disjoint fields of research, namely Grassmannian packing, dictionary 
learning and compressed sensing. In this section, without loss of generality, 
we will restrict the discussion to the case of p = 1. 

4.1 Packing Problem and Grassmannian Frames 

One way to characterize a frame is to evaluate how "overcomplete" it is. 
The redundancy expresses this measure of overcompleteness and is defined 
for U = {ui}'^^ in a n-dimensional vector space V as ^. A more informative 
measure is the coherence (or maximum correlation) , defined as the maximum 
of the absolute inner product between any two elements of a unit norm frame 



The constraint on unit norm frame could be removed if the inner product 
of Equation (7) is divided by norm of the frame elements. 

A fundamental result in frame theory bounds for the coherence of any 
frame and the bound depends only on the number of elements in the frame 
and the dimension of the associated space V. 

Theorem 12 (Welch, 1974) Let U = {uj}™ be a frame for V, then its 
Welch bound is 




{Ui} 



m 




(7) 




(8) 
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The equality hold iff U is an Equiangular Tight Frame (ETF), \ {ui,Uj)\ = c 
for all i, j with i ^ j, for some constant c ^ 0. Furthermore, the equality 



In order to construct equiangular tight frames, a possible approach is 
to define a set of frames which minimizes the coherence for all unit norm 
frames with the same redundancy. 

Definition 13 (Strohmer and Heath Jr., 2003) A sequence of vectors 
U = {ui}'^i in V is called a Grassmannian frame if it is the solution to 
min/i where the minimum is taken over all unit norm frames in V. 

A unit norm frame holding the equality in Equation (8) is called an opti- 
mal Grassmannian frame. Note that optimal Grassmannian frames do not 
necessarily exist for all packing problem: a known example (Conway et al., 
1996) is the packing of 5 vectors in M'^, with g = I, m = 5, n = 3, whose 
coherence is at best ^ while its Welch bound is 

The ETF and the Grassmannian frames offer a well suited tool to solve 
the packing problem. In its general formulation, for given n, m and g, the 
goal of the packing problem is to find a set of f?- dimensional subspaces Ui in 
a n-dimensional vector space V such that (i) the minimal distance between 
any two of these subspaces are as large as possible, or equivalently (ii) the 
maximal correlation between subspaces is as small as possible. A schematic 
view of the packing of a space is shown on left part of Figure 2. 

In the case of Gr(l,n), investigated in Strohmer and Heath Jr. (2003), 
the correlation depends only on n and m. Thus subspaces are lines passing 
through the origin in V and the angles between any two of the m lines 
should be as large as possible. As maximizing the angle between lines is 
the same as minimizing the absolute value of the inner product of the Ui 
vectors, finding an optimal packing in Gr(l,n) is equivalent to finding a 
Grassmannian frame. 

4.2 Compressed Sensing vs Grassmannian Frames 

It is interesting to see that the ETF are also linked with sparse coding and 
compressed sensing. Very briefly, sparse representations and compressed 
sensing (Donoho, 2006) proposes to acquire a signal y in V by collecting 
m linear measurements of the form = {uk,y) + with 1 ^ A; ^ m or 
equivalently: 



where U is a n x m sensing matrix, with m <^ n and e is an error term. 
The compressed sensing theory states that, if y is reasonably sparse, it is 
possible to recover y by convex programming, as it is the solution to 



holds only if m ^ 




2 



a = U^y + e 



rnin ||y||i s.t. \\a — U y||2^ e • 



56V 



(9) 
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In a noiseless settings, with a null vector e, to achieve an exact reconstruction 
of y, a sensing matrix should satisfy the {K, (5)-Restricted Isometry Property 
(RIP). 

Definition 14 For a n x m sensing matrix U, the K-restricted isometry 
constant 6 of U is the smallest quantity such that 

{^-5) \\y\\l^\\u^y\\l^{^ + 5) Ml 

holds for every K-sparse vectors y. 

A matrix with a small 6 constant indicates that every subset of K or less 
columns is linearly independent. Furthermore, any U sensing matrix which 
is {2K, (5)-RIP for some 6 < \/2 — 1 allows to recover the signal using Equa- 
tion (9) with the following error bound 

V A 

Matrices with small 5 are abundant and could be easily obtained. Matrices 
with Gaussian or Bernoulli entries have small 6 with very high probability if 
the number of measurement m is of the order of K log(n/K). However, test- 
ing a matrix for RIP is computationally intensive, as it requires to determine 
singular values for n-hy-K submatrices. 

An alternative approach is to rely on ETF for designing deterministically 
RIP matrices. From the Theorem 2 of Mixon et al. (2011), the following 
lemma are defined: 

Lemma 15 (Mixon et al., 2011) Let U = {n,,}jg/ be a unit norm frame 
on V. The smallest Smm for which U is {K, 5) -RIP is defined by 

(5min = max \\U^Us - Ish , 

SC{l,...,n},\S\=K 

where Us is the submatrix consisting of columns of U indexed by S. 

As it is computationally demanding to compute all the eigenvalues of 
S X S submatrices, the Gershgorin circle theorem is used to obtain a coarse 
bound hnked to RIP (DeVore, 2007; Applebaum et al., 2009). 

Lemma 16 (Mixon et al., 2011) Let U = be a unit norm frame, 

then the K-restricted isometry constant 

where n is the coherence. 

Thus, U is (AT, (5)-RIP for 5 ^ {K — 1)^ and using Welch bound given 
in Equation (8). As ETF meet the Welch bound with equality, they are 

{K, 6) for 5^ ^ rn(n^i) ' Hcncc, ETF are good candidates for producing 
deterministically RIP sensing matrices. 
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4.3 Dictionary Learning vs Grassmannian Frames 

The dictionary learning aims at capturing most of the energy of a signal 
database Y = {yj}'j^i C V and representing it through a collection U = 
{ui}^^ thanks to a set of sparse coefficients A = {aj}'j^^. This collection, 
which is redundant since m ^ n, is called overcomplete dictionary. This 
dictionary is not necessarily full rank, which is a difference from frames as 
imposed by Equation (6). Formally, the dictionary learning problem writes 
as: 

min ||y — C/A|||. s.t. \\aj\\o ^ K, j = I, . . . ,q 

and \\ui\\2= I, i = I, ■ ■ ■ ,m , 

where HojUq is the number of nonzero elements of vector aj. This prob- 
lem is tackled by dictionary learning algorithms (DLAs), in which energy 
representative patterns of the dataset are iteratively selected by a sparse ap- 
proximation step, and then updated by a dictionary update step. Different 
algorithms deal with this problem: see for example the method of optimal 
directions proposed by Engan et al. (2000) and generalized under the name 
iterative least-squares DLA (Engan et al., 2007), the K-SVD (Aharon et al., 
2006) or the online DLA (Mairal et al., 2010). 

Data-driven dictionary learning extracts atoms that endow most of the 
energy of the dataset in the underlying norm sense. They are learned empir- 
ically to be the optimal dictionary that jointly gives sparse approximations 
for all of the signals of this set (Tosic and Frossard, 2011a), as illustrated 
on the center part of Figure 2. During learning process, no constraint is put 
on the dictionary elements in term of coherence. Whereas in the packing 
procedure for m elements, as shown in Section 4.1, amounts to come up with 
elements with maximal distance between any two of them, hence minimizing 
the maximal correlation. 

The packing procedure is data-free, in the sense that it is independent of 
the data and is a pure geometrical problem, while the dictionary learning is 
data-driven. An interesting view is to connect the two visions, an approach 
which is emerging in recent problem. A possible approach is to constraint 
the packing problem on the energy of the dataset, in other words a data- 
constraint packing. 

An alternative point of view, as developed by Yaghoobi et al. (2009), is 
to constraint the dictionary learning with a penalty term on the coherence 
of the dictionary elements (see the right part of Figure 2). This could be 
formalized as adding the penalty term during the update of dictionary, to 
constraint the dictionary element to be close to ETF, as: 

\\U^U-Gfp ^ e with G = 



1 9ij 
9ji 1 
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where maxj^j \gij\ is less or equal to the Welch bound given in Equation (8). 

As a summary, and as illustrated on Figure 2, packing spans the data 
space equi-correlatively (left), dictionary learning spans the data energy 
sparsely (center), and coherence constrained dictionary learning spans the 
data energy sparsely and equi-correlatively (right). 



Figure 2: In the frame identification problem or packing problem (left), the 
frame elements (circles) are used cover the entire space under the constraint 
that the distance between any two elements should be maximal, being as 
close as possible to an ETF. In the dictionary learning setup (center), the 
frame elements are chosen to minimize the reconstruction error on the en- 
ergy of the dataset (represented as gray values in "G" shape). An hybrid 
approach (right) adds coherence constraints in the dictionary learning frame- 
work. 



5 Metrics between multivariate dictionaries 

In this section, we will use the metrics described in Section 2.3 and sum- 
marized in Table 1 which act on elements of a Grassmannian as ground 
distance. This ground distance has a key role for the definition of a metric 
between sets of points in a Grassmannian space, that is a distance between 
two subspaces spanned by their associated disctionaries. 

5.1 Preliminaries 

In this section, the vector space V is of dimension n x p, its elements will 
be denoted as U,W S V. Their respective spans are span(C/) = U and 
span(VF) = W. The indexed families of elements will be denoted U = 
{Ui}i£i and W = {Wj}ji^j. Indexed families of subspaces will be denoted 
U = {Ui}iei and W = {W,},ej. 

The Grassmannian manifold Gr(Q,n), g ^ p, together with any of the 
distances d discussed in Section 2.3, defines a metric space (or pseudo-metric 
space depending on the properties of the underlying distance). We will 
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denote it in the sequel as (G,d), and when there is no confusion as G. The 
following theorem characterizes the properties of Grassmannian manifolds. 

Theorem 17 (Milnor and Stasheff, 1974) Gr{Q,n) is a HausdorjJ, com- 
pact, connected smooth manifold of dimension Q{n — g) . 

Thus, we know that G is complete and (separable) Hausdorff, and from the 
fact that any topological manifold is locally compact, then one can define a 
Borel measure on G. We will denote by vr such a measure on the Borel a- 
algebra of G. The support of tt is denoted Spt(7r) and is the smallest closed 
set on which it is concentrated, that is the minimal closed subset X C G 
such that 7r(G\X) = 0. The triplet (G, d, vr) is then a metric measure space. 
Furthermore, G is a Polish space, that is a separable completely metrizable 
topological space. 

5.2 Hausdorff Distance 

A classical approach to define a distance between subsets of a metric space 
is to rely on the Hausdorff distance. Within our context, frames are subsets 
of the space G, and the Hausdorff distance is a first natural way to define a 
distance in the frame space. 

We denote by Br(W) the r-neighborhood of a set W in the metric space 
(G, d), that is the set of points U such that m({d(U, W) : W G W} < r. 

Definition 18 Let U = {Ui}i,zj and W = {Wj}jgj be two subsets of the 
metric space (G, d) . The Hausdorff distance between U and W, denoted by 
dH(^i^), is defined as 

dH{U,W) := inf{r > : U C ^^(W) and W C Br{U)} , 

or equivalently as: 

dH(V,W) := max (sup inf d(U,W), sup mf d(U,W)] . 

Proposition 19 (Burago et al., 2001) Consider the metric space G, 
then: 

1. dn is a semi-metric on 2^ (the set of all subsets ofG), 

2. (i//(U, U) = for any U C G where U denotes the closure o/U , 

3. IfV and W are closed subsets of G and dH{U,W) = 0, then U = W. 

Then it turns that the Hausdorff distance is a true metric iff the con- 
sidered sets are closed. A known limitation is that the Hausdorff distance is 
non-smooth. 
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This distance could be reformulated in term of a correspondence. This 
allows to explicit the link between the Hausdorff distance and the class of 
Wasserstein distances that will be introduced in the sequel (cf. Section 5.3). 



Definition 20 Let U and W be two sets, a subset R C V x W is a corre- 
spondence if\/U eU there isW eW s.t. {U, W) e R and VW G W there is 
U s.t. {U,W) G R. 

Let 7^(U, W) denotes the set of all possible correspondences between U 
and W. It is possible to distinguish between a correspondence (p between U 
and W defined on the subset i? C U x W as 

R = {(U,W) G U X W : W G cl){U),U G (^(W)} 

and the subset R, which is also called the graph of (p. 

Proposition 21 Let U = {Ui}i^i and W = {Wjljgj be two subsets of 
the metric space (G,d). The Hausdorff distance between U andW could be 
rewritten as: 



(ii^(U,W) =inf{ sup d{U,W) : R e n{i],W)} , 



or equivalently 



(iH(U,W)= inf sup d{U,W) . (10) 



With the view of this last proposition, one can easily construct new set- 
metrics by replacing the sup operator in Equation (10) by any ip norm, 
hence leading to more interesting smooth set-metrics. Such metrics belong 
to the class of Wasserstein distances by restricting the Borel measure to the 
Dirac function. The next section clarifies this statement. 



5.3 Wasserstein Distance 

Let us define by C(G) any collection on G and by Gw = {{i^,'^u) : ^ G G} 
where ttu is a Borel measure. 

Definition 22 Let U, W G Gw- ^ measure vr on the product space U x W 
is a coupling of vru and vrw if 

^(U X W) = ^u(U), ^(U X W) = 7rw(W) (11) 

for all Borel sets U C U, W C W. 

We denote by 7W (vru, vrw) the set of all couplings of vru and vrw. An in- 
teresting property is that given vr G 7W(7ru,7rw) then -^(vr) = Spt(7r) is in 

7^(u,w). 
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Definition 23 Let U, W G Gw, and tt a coupling as defined in Equa- 
tion (11). Given p > 0, the Wasserstein distance is defined as: 

(i^(U,W)= inf (/ d{U,WydTT{U,W)]'' if I < p (12) 

7reA4(7rD,7rw) VJUxW / 

d^(U,W)= inf [ diU,WfdTr{U,W) if < p < 1 (13) 

7r6A^(7ru,7rw) JUxW 

d^(U,W)= inf sup d{U,W) otherwise. (14) 

7reA^(7rtj,7rw) (W,W)e7^(7r) 

This distance is also called the Wasserstein-Kantorovich-Rubinstein. 

Proposition 24 Let U, W be two subsets of G, if we consider as a measure 
TT the Dirac one, then the Wasserstein distance is related to the Hausdorff 
distance by the following inequality: 

di^(U,W) ^ d^((U,7ru),(W,7rw)) . (15) 

5.4 Set-Metrics for Frames and dictionaries 

Metrics in Equations (12), (13), (14) and (15) are defined on Grassmannian 
space. As indicated in Section 5.1, (G, d) is a separable metric space allowing 
to compute a distance between the collection of subspaces spanned by a 
frame and the collection of subspaces spanned by another. The distance 
between two frames in the frame space, that is V, is defined as follows. 

Definition 25 Let U = {C/j}je/ and W = {Wj^j^j be two frames of the 
n X p vector space V. We define a distance between these two frames as: 

dF{V,W) = ds{U,W) , (16) 

where V = {Ui : Ui = span(f7i),i G /} and W = {Wj : span{Wj),j G J}, 
and ds is either dn or dw- 

Prom this definition, we can formulate the proposition: 
Proposition 26 Let C(V) be any collection on V.Then the following holds: 

• dp is pseudo-metric and hence (C(V),dir) is pseudo-metric space, 

• dp is invariant by linear combinations. 

Proof The proof is a direct consequence of Equation (16) since dp is de- 
fined as a distance between subspaces. ■ 

The frame distance dp is defined as the distance between their subspaces, 
that is di? is acting in the frame space V whereas the distance ds is acting 
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in the Grassmannian space G. As an element Z// of G is a subspace, there 
exist an infinite number of elements f7 in V spanning li. Thus two distinct 
frames Ui / U2, that is two collections {C//}jg/ and {Uf}i,^i of elements 
in V, could span the same collection of subspaces U = {ly(i}i^j in G. In 
other words, a distance dp{\Ji,\J2) = ^^(U, U) = could exist for two 
separate frames Ui 7^ U2. As the separability axiom does not hold, dp is a 
pseudo-metric and the separability axiom is relaxed to the identity axiom: 
d{x, x) = 0, Vx S X. 

Pseudo-metric spaces are similar to metric spaces and important prop- 
erties of metric spaces could be proven as well for pseudo-metric spaces, so 
the fact that (C(V),di?) is a pseudo-metric space has a very limited impact 
for further theoretical developments. This situation is even desirable, as 
the distance between frames is defined through their associated subspaces 
yields the invariance by linear combinations, which is an essential property 
for comparing multivariate dictionaries. 

The proposition of set-metrics for frames is a major contribution for 
the dictionary learning community as it is the first proposition to define a 
real metric in the dictionary space. Previous works rely on heuristics to 
assess differences between dictionaries and no tractable framework has been 
proposed to compare dictionaries. The work of Skretting and Engan (2011) 
is a noticeable attempt to quantify the similarity between two dictionaries, 
without achieving to define a metric. 

The proposed pseudo-metrics - relying on Hausdorff and Wasserstein 
set-metrics, as well as chordal, Fubiny-Study or other ground metrics - have 
been described in the literature and are easy to implement. The following 
section described a multivariate dictionary learning algorithm which is then 
applied in Section 7 on synthetic data and in Section 8 on real world sig- 
nals, demonstrating that these metrics could be applied "out-of-the-box" to 
design better dictionary learning algorithms or to assess machine learning 
datasets. 

6 Dictionary Learning for Multivariate Signals 

Hereafter, to illustrate the metrics between frames with concrete examples, 
we will considered a multivariate dictionary U as a collection of m multi- 
variate atoms Ui G M"^''. In this section, models and dictionary learning 
algorithms for multicomponent signals are detailed. 

6.1 Model for Multivariate Dictionaries 

We consider a multivariate signal Y £ M*^^^, whose decomposition is carried 

out on the multivariate dictionary U thanks to the multivariate model (Barthelemy 
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et al., 2012): 



Y 



m 

^aiUi + E 

i=l 



(17) 



where Oj € M denotes the coding coefficient and E G R"''^^ the residual 
error. Two problems need to be handled on this model: the first one, called 
sparse approximation, estimates the coefficient vector a when the dictionary 
U is fixed, and the second one, called dictionary learning, estimates the best 
dictionary U from the dataset Y = {^ jj^^- 

Since the dictionary is redundant, m ^ n, the linear system of (17) 
is thus under-determined and has multiple solutions. The introduction of 
constraints such as sparsity allows to regularize the solution. With the sparse 
approximation, only K active atoms are selected among the m possible 
ones and the associated coefficients vector a is computed to maximize the 
approximation of the signal Y. The multivariate sparse approximation can 
be formalized as: 



mma 



Y 



i=l 



S.t. 



(18) 



where K <^m is a constant. But this problem is NP-hard (Davis, 1994). 
In the univariate case, non-convex pursuits tackle it sequentially, such as 
matching pursuit (MP) (Mallat and Zhang, 1993) or orthogonal matching 
pursuit (OMP) (Pati et al., 1993). Many other sparse approximation algo- 
rithms are reviewed in Tropp and Wright (2010). In the multivariate case, 
Equation (18) is solved by the multivariate OMP (M-OMP) (Barthelemy 
et al., 2012). 

Dictionary learning presented in Section 4.3 is now extended to the mul- 
tivariate case. Adding an index j to variables to denote their link to the 
signal Yj, the multivariate dictionary learning problem is formalized by: 



mmu mm^ 



i=l 



3.t. Ilajllg ^ K, j = l...( 



(19) 



and \\U, 



1, i = 1 . . .m . 



The multivariate DLA (M-DLA) proposed in Barthelemy et al. (2013a) 
solves this problem by learning a multivariate dictionary from the multi- 
variate dataset Y. This algorithm has been used to study multivariate 
time-series such as 2D spatial trajectories (Barthelemy et al., 2012) and 
EEG signals (Barthelemy et al., 2013a). 



6.2 Model for Multicomponent Dictionaries 

Some other works use multicomponent dictionaries to analyze multicompo- 
nent data: audio- visual data (Monaci et al., 2007, 2009), electro-cardiogram 
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signals (Mailhe et al., 2009), color images (Mairal et al., 2008), hyperspectral 
images (Moudden et al., 2009), stereo images (Tosic and Frossard, 2011b). 
But their decomposition model is different from the one used with multi- 
variate dictionaries. 

With multicomponent dictionaries, each component is associated with 
a univariate atom multiplied by its own coefficient, contrary to the multi- 
variate model of Equation (17) where components form a multivariate atom 
multiplied by one unique coefficient, identical for all components. Let re- 
call that Y = [ui, . . . ,yp] is a n x p matrix and Ui = . . . , Ui^p] is a 
n X p matrix. In this paragraph, the coefficients Oj = [aj,i, . . . form a 

/?-dimensional vector. Each component k is decomposed independently: 

m 

yk = '^ ai,k Ui^k + ek , k = I . . . p , (20) 
1=1 

thus the coefficients a is amx p matrix. To compare the multivariate model 
of Equation (17) with the multicomponent model of Equation (20), the latter 
can be reformulated as: 

m 

Y = ^ai(g)Ui + E , 

i=l 

where (8) is defined as the element-wise product along dimension p. Thus, 
in the multicomponent approach, multivariate signals are transformed in 
parallel univariate signals, causing an important information loss. 

Sparse approximations adapted to this model estimate the multicom- 
ponent coefficients associated to the selected active atoms. The relation 
between the components of an atom is only considered during the atom 
selection: the index i is the same for the p components Uj ^. The selected 
atom Ui is thus jointly chosen between components thanks to a sparse mixed 
norm (Gribonval et al., 2007; Rakotomamonjy, 2011). Algorithms adapted 
to multicomponent approach are not detailed more here since this model is 
not used in the sequel. 

6.3 Model for Rotation Invariance 

The multivariate model described in Equation (17) is now extended to in- 
clude rotation invariance. A rotation matrix i?j S M^^'' is added to each 
atom Ui G M"^^, with RiRf = Ip. Thus, the decomposition model becomes: 

m 

Y = Y,a^U,Ri + E . (21) 

i=l 

This model is called nD Rotation Invariant (nDRI) in Barthelemy et al. 
(2013b), since multivariate atoms Ui are now able to rotate. 
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The multivariate sparse approximation described in Equation (18) is ex- 
tended to this rotation invariant model. Coefficients a = {ai}^^ and rota- 
tion matrices R = {-Rijl^Li estimated with the nDRI-OMP (Barthelemy 
et al., 2013b) such as: 

2 



mma,fi 



i=l 



S.t. 



|a|lo ^ ^ 



and RiRi = Ip, i = 1 . . .m 



The core of the nDRI-OMP is the orthogonal Procrustes problem. The goal 
is to estimate the coefficient a E M and the rotation matrix R S M^^p which 
optimally register the atom U £ W^^p on the signal Z G M"^'': 



~ 1 1 2 T' 

[a, R) = arg miiia^R \\ Z — aU R\\p s.t. RR 



In 



The function which computes this step in denoted (a, R) = nD_Registration(?7, Z) 
in the following. 

The multivariate dictionary learning described in Equation (19) is ex- 
tended to the rotation invariant case as: 

2 



minu 



mm„^,K, 



Y, 



i=l 



Oji.j Ui Ri^j 



S.t. 



and RijRfj = Ip, i = 1 . . . 



m, J 



1..., 



and \\U 



1, i = 1 . . .m . 



The algorithm which solves this problem is called nDRI-DLA (Barthelemy 
et al., 2013b) and it has been applied in a trivariate case to study 3D spatial 
trajectories. 



7 Experiments on Synthetic Data 

This work is the first attempt to assess qualitatively or quantitatively the 
dictionary learning algorithms with real metrics. Metrics on frames have 
several immediate and useful applications to improve our understanding of 
learning algorithms and datasets in a dictionary-based framework. This 
section is devoted to show why relying on metrics allows to dramatically 
improve the assessment of dictionary learning algorithms. More precisely, a 
set of experiments is conducted in Section 7.3 to reproduce state-of-the-art 
results on synthetic dataset and shows how the different proposed metrics 
compared with the commonly used indicators. A second set of experiments, 
presented in Section 8, shows how metrics could be used on multivariate 
signal, in the context of Brain-Computer Interface, to evaluate qualitatively 
datasets of brain signal. 
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7.1 Experimental Protocol 



The experimental protocol described hereafter was inspired by Aharon (2006), 
Barthelemy et al. (2012) and Barthelemy et al. (2013b)^. It is presented leav- 
ing out the explanations relative to the shift-invariant signal processing to 
lighten the description. 

A dictionary with m multivariate atoms of p = 10 components, generated 
randomly, serve as a generating synthetic data and is thus called the original 
dictionary. By combining a small number of atoms of the original dictionary, 
a training dataset (Ygtraight) is produced. This training dataset is used to 
extract a learned dictionary with at least m atoms. The quality of the DLA 
is then evaluated by comparing how many atoms from the original dictionary 
have been recovered in the learned dictionary. 

Here, the original dictionary U of m = 135 normalized multivariate 
atoms is created from white uniform noise and the length of the atoms is 
n = 20 samples. The training set Yg^raight is composed of q = 2000 signals of 
length n and it is synthetically generated from dictionary U. Each training 
signal is generated as of the sum of three atoms, the coefficients and the atom 
indices being randomly drawn. At first, no noise is added to the generated 
signals for the set of experiments conducted in Section 7.3. An analysis of 
the noise sensitivity is proposed in Section 7.4. The dictionary initialization 
is made on the training set, and the learned dictionary U is returned after 
80 iterations over the training set. 

The experimental protocol was slightly changed to give the training set 
Yrotation- In this casc, a rotation is applied for each atoms composing the 
training signals. The coefficient of the rotation matrix, see Equation (21), 
are drawn randomly. 

7.2 Detection Rates and Metrics Calculus 

In the experiments, a learned atom Ui is considered as detected, or recovered, 
if its correlation value Vi with its corresponding original atom Ui is greater 
than a chosen threshold s. In the multivariate case of M-DLA described in 
Equation (17), it is expressed as: 



where s is a threshold fixed at 0.99 or 0.97. The detection rate is defined as 
the percentage of the original dictionary atoms which are recovered in the 
learned dictionary and it is denoted as to. 99 and to. 97- This threshold-based 
approach is very common in the DLA community to obtain a quantitative 
measure on dictionaries, but it is not a metric. Remark that this detection 

^In the context of rotation invariance, this experiment can be viewed as the extension 
of Barthelemy et al. (2013b) in dimension p — 10. 




(22) 
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rate is invariant to sign: C/j is considered recovered if it is close to the original 
atom Ui or to its opposite —Ui. The invariance to scale is already obtained 
since atoms are normalized. 

This is naturally extended to include rotation invariance, as shown in 
Equation (21) for nDRI-DLA setup, by computing the correlation between 
Ui and Ui after nD registration: 

Ui ^ s with (ui, •) = nD_Registration(C/j, Ui) . (23) 

Due to the nD registration step, this detection rate is invariant to rotation: 
Ui is considered recovered if it is a rotated version of the original atom Ui. 

In the following, two multivariate DLA algorithms are tested: M-DLA 
and the rotation invariant nDRI-DLA. The M-DLA is tested on the dataset 
Ystraight; which does not include rotation of original atoms. The nDRI- 
DLA is evaluated on the two datasets Ygtraight and Yrotation- The assess- 
ment of nDRI-DLA for both datasets relies on the detection rate based on 
Equation (23). The evaluation of the M-DLA detection rate is based on 
Equation (22). 

Thanks to the Equation (16), it is possible to define real set-metric be- 
tween dictionaries U and U. With set-metrics, one important choice is 
the ground metric. For the multivariate dictionaries presented here, any 
distance described in Section 2.3 could be selected as ground distance for 
Hausdorff or Wasserstein metric. The ground metric allows to compute the 
distance between a learned atom and its associated atom in the original 
dictionary. For now on, the notation is changed for the sake of clarity to 
indicate the set-metric as a subscript and the ground distance as a super- 
script, for example is the Wasserstein distance based on chordal ground 
distance. 

To emphasize the importance of the ground distance choice and to clarify 
the results, only two ground distances are considered in this section. The 
first ground metric is the chordal distance d^, defined in Equation (2), chosen 
for its invariance to linear combination, thus including rotations. As all 
principal angle-based ground metric share these properties, the chordal have 
been selected only because it is a widely accepted choice for Grassmannian 
manifold. 

The second ground metric is a simple Frobenius-based distance d^, ex- 
pressed for multivariate atoms Ui and Uj as: 

(df([/i, Uj)y =\\Uj - U\\l= 2 -2tmceiUjU) = 2 (l - (^U, , 

under the classical assumption that the dictionary atoms have unit norm 
in the Frobenius sense, || ||f = || ||f = 1- This distance is a metric in 
Euclidean space but is not in Grassmannian spaces, it is thus not described in 
Section 2.3. The distance d^ is related to the detection rate of Equation (22), 
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Figure 3: Evolution of detection rates and dictionary distances for M-DLA 
applied on Ygtraight as a function of the learning iterations. 



but without the sign invariance: Uj is not considered recovered if it is close 
to —Ui. This distance is not invariant to linear transforms and hence to 
rotation and sign change. 

The distances d'^ and served as ground distance for Hausdorff and 
Wasserstein set-metrics. For the latter, it is parametrized with p = 1, see 
Equation (12), and also known as Earth Mover's distance or Kantorovich 
distance. Also, the measures are uniform on the whole support. We inves- 
tigate here the four possible combinations: Hausdorff over chordal (d^) or 
over Frobenius-based distance {d-^) and Wasserstein over chordal (d^) or 
Frobenius (dy^). The set-metric distance are rescaled to be visually compa- 
rable to detection rates which are given in percentage: 



7.3 Dictionary Recovering 

The M-DLA is evaluated on a dataset generated from a combination of the 
original atoms, that is dataset Ygtraight which does not include rotations. 
The results are presented on Figure 3, where the recovering rate of the 
learned dictionary, in term of distances or detection rates, is assessed for 
each iteration. 

The detection rates to. 99 or ^0.97 raises abruptly from to around 70% 
between iterations 5 and 20 and shows non-smooth variations around stable 
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Figure 4: Same as Figure 3 with nDRI-DLA on Ystraight^ it is a multivari- 
ate dictionary learning with rotation invariance (nDRI-DLA) applied on a 
dataset without rotation of the signals (Ygtraight)- 

value, 67% for io.99 and 72% for to.97) for the rest of the iterations. The 
Wasserstein set-metrics offer a better picture, this distance is more smooth 
and more stable that the detection rates. This metric captures the fact that 
as the learned dictionary atoms are initialized with training signals of the 
dataset, the distance between the learned and the original dictionaries is not 
null, which is not the case for the detection rates. When using the Wasser- 
stein set-metrics, the choice of ground distance has only a limited influence: 
the Frobenius-based metric stabilizes earlier than the chordal-based one. 
For the Hausdorff set-metrics, the experimental results confirm a known 
fact: the Hausdorff distances are less sensitive to internal variations inside 
the considered sets as they capture the distance of the most extreme point. 
Frobenius-based Hausdorff metric does not capture the changes caused by 
the learning algorithm, it stays with values around 40% for all iterations, 
whereas the chordal-based one shows a similar pattern to the Wasserstein 
distances but with a smaller amplitude. 

The results obtained with nDRI-DLA on dataset Ygtraight are shown 
on Figure 4. The set-metric ~ Wasserstein on chordal and Frobenius or 
Hausdorff on chordal ~ indicate that the algorithm slowly improve during the 
first 20 iterations before stabilizing. As observed on M-DLA, the Hausdorff 
distance based on chordal shows no sensitivity to the dictionary evolution 
during the learning phase. Another remark concerns the ground distance, 
here Frobenius or chordal, which has a major influence on the set-metric 
distance: the Wasserstein and Hausdorff distance based on chordal give very 
similar results. The detection rate are null for the whole learning phase and 
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Figure 5: Same as Figure 4 with nDRI-DLA on Yj-otation: it is a multivari- 
ate dictionary learning with rotation invariance (nDRI-DLA) applied on a 
dataset with rotation of the signals (Yrotation)- 



thus completely uninformative. 

The poor results of nDRI-DLA on Ystraight shown on Figure 4 are caused 
by the small size of the dataset given the large dimension of the multivariate 
components to learned: only q = 2000 signals for learning m = 135 atoms 
with p = 10 components. This is enough to achieve decent performances 
with M-DLA, but as nDRI-DLA should learn the rotation matrix on top 
of recovering the atoms and their coefficient, nDRI-DLA failed to obtain 
correct performance on dataset Ystraight- However, with dataset Yrotation 
each input signal contains three atoms from the original dictionary with 
different rotation coefficient, it is thus easier for nDRI-DLA to recover the 
atoms and learn the rotation matrices, as it is demonstrated on Figure 5. 

With dataset Yrotation > nDRI-DLA performances are shown on Figure 5. 
The detection rate with tg.gg stays at 0, whereas to. 97 abruptly increase from 
iteration 25 and displays large non-smooth fluctuations due to its binary 
nature. Moreover, for to. 99 very few atoms are considered as recovered, 
whereas for to. 97 (which remains a selective rate) around 75% of atoms are 
considered as recovered. This highlights the extreme sensitivity of the de- 
tection rate with respect to the threshold value s The Frobenius distance is 
affected by rotations and thus not suited for this setup: neither Wasserstein 
nor Hausdorff set-metric based on Frobenius distance are able to capture 
the evolution of the learned dictionary atoms. As explained previously the 
Frobenius- Wasserstein distance is not sensitive enough for the present setup. 
The chordal-based Wasserstein gives useful information on the convergence 
of the algorithm, which is smoothly increasing at each iteration. This ex- 
ample shows the importance of selecting an appropriate ground distance for 
set-metric, for example a rotation invariant distance such as the chordal is 
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Figure 6: Noise sensitivity analysis: averaged results for detection rates 
and distances with different noise level, that is 10, 20, 30 dB and no noise. 
(Top) M-DLA on dataset Ygtraight^ (Center) nDRI-DLA on Ygtraight (Bot- 
tom) nDRI-DLA on Yrotation- 

required to correctly assessed the nDRI-DLA with dataset Yj-otation- 
7.4 Noise Sensitivity Analysis 

For the next experiments, learning are repeated 10 times and Figure 6 shows 
the averaged detection rates and distances computed after 80 iterations. The 
previous experiments are replicated for different noise levels in the three 
setups: M-DLA on Ygtraighti nDRLDLA on Ygtraight cind nDRLDLA on 
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Yrotation- White Gaussiaii noise is added at several levels to obtain datasets 
with signal-to-noise ratio of 10, 20 and 30 dB, and a dataset is kept without 
adding noise. On each plot, to ease the comparison, the last column indi- 
cates the results obtained in the noiseless setting after 80 iterations, as in 
Figures 3, 4 and 5. The results are also averaged over 10 repetitions for the 
noiseless case. 

In the M-DLA case, on the left of Figure 6, the detection rates and the 
distance are not affected by noise, showing the robustness of the learning 
algorithm on multivariate signals. These conclusions could be extended to 
the nDRI-DLA case on dataset Ygtraight, at the center of Figure 6. The orig- 
inal dictionary is poorly recovered, less than half of the atoms are identified, 
but the results are not affected by the presence of noise, even at high level. 
The detection rates to. 97 and to. 99 stay at for all experiments and are thus 
completely inadequate for these evaluations. 

Applying noise on the examples of dataset Yrotation as the same effect 
on nDRI-DLA than applying random rotations on each original atoms: the 
atoms "pop out" from the noise, that is the atoms are the only stable signals 
existing in the dataset, all other parameters vary. It helps the algorithm 
to correctly reconstruct the original atoms, find their coefficient and their 
rotation matrix. This explains that the dictionary reconstructing in the 20 
and 30 dB cases have results as good as the noiseless case. With only 10 
dB, the noise yields small perturbations, corrupting the learned atoms and 
resulting in a decrease of recognition rate. 

These experiments on synthetic dataset are aimed to demonstrate that 
direct applications of the previously described metrics offer immediate and 
handy tools for assessing dictionary learning algorithms. To exploit the full 
potential of the proposed metrics, we have chosen to propose experiments 
on multivariate datasets, but all these results could be applied directly on 
univariate datasets. 

8 Experiments on EEG Signals 

This section demonstrates some possible applications implied by the defi- 
nition of metrics between dictionaries. We choose to present experiments 
on Brain-Computer Interface (BCI) because the spatio-temporal structure 
of the data offer, to some extend, an intuitive example of the application 
of multivariate set-metrics. Most of the BCI systems rely on electroen- 
cephalographic (EEG) measurements, which are electric variations caused 
by brain waves are recorded on the surface of the head with several elec- 
trodes. A dataset is thus an ensemble of timeseries, each of them represent- 
ing the electric activity measured at a precise spatial location. Due to signal 
propagation and to physical properties of the head, two neighboring spatial 
locations have highly correlated signals, as it can be seen on Figure 7. 
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Example ol EEG during motor imagery task 
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Figure 7: Example of an EEG during a 2 seconds motor imagery task. The 
electrical signals recorded from the 22 electrodes are plotted with a slight 
shift on the ordinate axis. Data from BCI Competition IV, set 2a, subject 
7. 

8.1 Experimental Protocol 

Dictionary learning algorithms handling multivariate signals are able to cap- 
ture the temporal as well as the spatial coherence of the EEG datasets 
(Barthelemy et al., 2013a). In this section, multivariate dictionaries are 
learned with M-DLA using the datasets of BCI Competition IV set 2a (Brun- 
ner et al., 2008; Tangermann et al., 2012). This is a motor imagery task, 
where 9 subjects have been instructed to imagine four classes of movements 
(left hand, right hand, tongue or feet) during two sessions conducted on 
different days. Each session consists of 288 trials (72 trials for each class) 
and a trial is the 3 seconds recording of p = 22 electrodes sampled at 250 
Hz. For each session, for each subject and for each task, the M-DLA learns 
a dictionary of five shiftable patterns (giving a five time redundant dictio- 
nary), with a sparsity parameter K = 1. More details about this learning 
can be found in Barthelemy et al. (2013a). 

In the following, two experiments are presented which rely on the set- 
metrics between dictionaries to investigate the properties of the BCI dataset 
thanks to clustering techniques. A part of the BCI community is organized 
and structured around the BCI Competition datasets, even if the individual 
variability of the subjects, both in term of inter-session evolution and in 
subject-specific characterization, is still largely unharnessed: between 20- 
30% of the BCI subjects obtain catastrophic results with the state of the 
art algorithms, a phenomenon called "BCI illiteracy" or "BCI-inefficiency" 
(Vidaurre and Blankertz, 2010; Hammer et al., 2012). Is it possible to detect 
subjects who are "BCI-inefficient"? Do the subjects in the different compe- 
tition datasets have the same variability? How important is the inter-session 
variability, knowing that a separate session is often used as an evaluation set 
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for competition? All these questions require an objective measurement and 
set-metrics applied on learned dictionaries offer such a solution. The present 
experiments are intended to demonstrate that the proposed set-metric are 
ready to be applied and that their immediate application offer some inter- 
esting insight on BCI dataset. 

The first experiment proposes to investigate the link between the clus- 
tering of subjects and their BCI performance, in order to gain a better 
understanding of the "BCI-inefficiency" phenomenon. The second exper- 
iment is oriented to investigate the subject-specific intersession variability 
and its influence on the dictionary learning. 

8.2 Relation between Subject Clusters and BCI-InefRciency 

For these experiments, a dictionary is learned with M-DLA for each class 
of each subject: each subject is associated with four dictionaries (left hand, 
right hand, tongue and feet) learned over the first session of the dataset 
(used as a training set for the BCI Competition). 

A distance matrix between subject is computed for each class c: for a 
given class all the distances between subject's dictionary are evaluated with 
the Wasserstein set-metric based on chordal distance. All the distances g^^ 
between subjects i and j are converted in Gaussian similarities s^^ with 

the following relation : = exp ^— (g'^j)^/2(T^^ with a = 1. For each class, 

clusters of subjects are gathered using affinity propagation (Frey and Dueck, 
2007). Afhnity propagation find the optimal number of clusters relying only 
on the similarities between subjects and the preference value of each subject, 
which the predisposition of each subject to become the exemplar of a cluster. 
Here, we apply a common approach where all subjects have an identical 
preference value equal to the median of C^. 

The obtained cluster for each class are then combined using the clus- 
ter ensembles approach defined by Strehl and Ghosh (2003). The results 
is a unique and global set of C clusters, C being a parameter, obtained 
by maximizing the shared mutual information of all classes. A thorough 
analysis shows that subjects are aggregated in 3 or 4 stable cluster ensem- 
bles, represented with different colors on the left side of Figure 8. On this 
Figure, the performance of each subject obtained with the state of the art 
algorithm (Ang et al., 2012) is shown on the ordinate-axis, the subjects are 
sorted according to their performance on the abscissa-axis. Using C = 3 
cluster ensembles, it appears that the subject with the highest performance 
is always alone in a cluster, as the subject with the worst performance. All 
the other subjects are gathered in the same cluster. With C = 4, 5 or 6, 
only the four cluster ensembles shown on Figure 8(Left) are found. 

To have a better visualization of the relation between cluster, a hierar- 
chical clustering is shown on the right side of Figure 8. It should be noticed 
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Classification scores and cluster ensembles of BCI subjects Hierarchical clustering based on HausdorfF distance 
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Figure 8: Left: Performance of the subjects from the BCI Competition IV- 
2a with state of the art algorithm from Ang et al. (2012), color indicates 
cluster ensemble obtained with consensus clustering on learned dictionaries. 
Right: Hierarchical clustering on the same dataset, using Hausdorff distance 
to determine linkage. 

that instead of applying a common linkage criteria, such as the single linkage 
or the complete linkage, we rely here on the Hausdorff distance to evaluate 
the distance between sets of subjects. The results are very similar to those 
obtained with complete linkage. On the resulting dendrogram, it is visible 
that the subjects #2, #3, #9, #8, #5 and ^^4 belong to the same clus- 
ter and that each of the subjects #1, #7 and #6 constitute three separate 
clusters. 

These experiments explicit new techniques to investigate the BCI inef- 
ficiency thanks to the set-metrics and the dictionary learning. In the BCI 
Competition IV dataset 2a it appears that most of the subjects seem to share 
a common profile except two extreme C3<SGS, Si BCI inefficient and a BCI ef- 
ficient ones. The possibilities offered by multivariate dictionary learning in 
conjunction with set-metric bring new opportunities to qualitatively assess 
datasets used in competitions and challenges. These approaches could help 
the community to propose more consistent and more complete benchmarks 
or evaluations. 

8.3 Temporal Variability of Subject Brain Waves 

In this section, the experiments aimed at investigating the temporal vari- 
ability of a BCI subject, in other words how different are our brain waves 
from day to day? Appreciating this variability is crucial as it is very com- 
mon to train a classifier with session "A" and to evaluate the same classifier 
on data acquired from session "B" , recorded several days after session "A" . 
This was the case for the BCI Competition IV dataset 2a. Intersession vari- 
ability is also at the heart of the calibration problem: a common practice 



31 



Billet- Caiitliy 

FubLiii-Study 

Chorda! 

Geodesic 

Ensemble 

Cbaiice level 



"2 2.5 3 3.5 4 4.o 5 5.5 6 6.5 7 7.5 8 8.5 9 
Number of clusters 

Figure 9: Consensus clustering on the learned dictionaries with different 
ground metrics. The proportion of clusters specific to one session is plotted 
as a function of the number of clusters. 

is to calibrate the electronic BCI system, including the classifiers, each day. 
Nonetheless a legitimate question when using a BCI system all day long is 
how often does it need to be calibrated? The intrinsic temporal evolution 
of brain waves is a long studied problem and yet no reference methodology 
has been found. The proposed experiments could help to properly design 
such methodology. 

Here, a dictionary is learned with M-DLA for each subject, each class 
and each session. Thus for each session, "A" and "B", there is 36 dictio- 
naries learned, four by subject. The parameters used for M-DLA are the 
same than the previous section. The distances between dictionaries are 
computed with Wasserstein set-metric based on different ground distances: 
geodesic, chordal, Fubini-Study and Binet-Cauchy, defined respectively in 
Equations (1), (2), (3) and (5). As explained before, the clustering on 
dictionaries of each subject for a given class is obtained with the affinity 
propagation. The clusters for each class are then combined in cluster en- 
sembles with the consensus algorithm of Strehl and Chosh (2003). There 
are cluster ensembles for each ground distance and we also compute a global 
cluster ensembles gathering all classes and all ground distances for a given 
subject. 

The 9 subjects are separated based on the time of their recordings, either 
in session "A" and "B", resulting in 18 recordings to cluster. A cluster could 
gather recordings from sessions "A" or "B" and is said to belong to the 
session which has the most important representation among the recordings. 
For example, if a cluster gathers 3 recordings, two from session "A" and 
one from "B", the cluster is said to belong to "A" . We propose to use as 
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Figure 10: The 9 subjects recorded during 2 different sessions (A and B). The 
spatial positions chosen according to Laplacian eigenmaps. The color indi- 
cates the 4 cluster ensembles obtained with consensus clustering on learned 
dictionaries. The thickness of the links is proportional to the distance be- 
tween learned dictionaries. 

measure of dissimilarity the proportion of recordings in the same cluster. A 
score of 1 indicates that all clusters contain either recordings from session 
"A" or, in the exclusive sense, from session "B" . The minimal score is and 
indicates that each cluster contains as much recordings from session "A" 
and "B". The chance level is at 0.5. 

The Figure 9 shows the proportion of cluster's recordings in the same 
session when the number C of cluster ensembles increases. The different 
metrics yield the similar results and the global cluster ensembles, computed 
from all ground distances, displays a high level of dissimilarity between the 
two sessions. Those results are stable with respect to the number of cluster 
ensembles C. 

The Figure 10 proposes a different visualization of the same phenomenon. 
The 9 recordings of session "A" (one per subject) and the 9 ones of session 
"B" are represented with spatial positions drawn according to Laplacian 
eigenmaps (Belkin and Niyogi, 2003). By applying an eigenvalue decompo- 
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sition on the adjacency graph build from the distance matrix, the Laplacian 
eigenmaps embed the recording in and thus offer an isometric represen- 
tation. A thorough experimental analysis with the different metrics and the 
embedding produced by Laplacian eigenmaps leads us to choose = 10 as 
the number of nearest neighbors picked by the Laplacian eigenmaps and dis- 
crete weights (i = oo). Those parameters, used on the Fubini-Study distance 
matrix in the Figure 10, have been chosen for their illustrative potential and 
different parameters yield qualitatively similar results. The color code indi- 
cates how the recordings are separated into 4 cluster ensembles. The size of 
each link between recordings is proportional to the Wasserstein distance. 

This figure shows that only two recordings (subjects #6 and ^7 from ses- 
sion "A") are associated with cluster ensembles containing mainly recordings 
from session "B" . On the one hand, it appears clearly that all the recordings 
from session "B" are clustered and are separated by small distances, with 
the notable exception of the subject ^7. On the other hand, recordings from 
session "A" are all very distant from recording of session "B" and belong all 
to the same cluster, except for subjects #6 and #7 which are part of clus- 
ters of session "B" . The subject #7 is the one with the highest recognition 
rate, as shown on Figure 8: a tentative explanation is that the recordings 
of subject 7^7 are very similar between sessions "A" and "B" whereas all 
others subjects recording differ from one session to the other. 

As a corollary, when using a measure reflecting the subject-dependence 
of the results, no clear results appear and the measure stays at the chance 
level. The measure used evaluates the dissimilarity between the subjects: 
each time the two sessions of a given subject are not in the same cluster, 
the indicator is decreased. 

This demonstrates that there is a clear variation between sessions and 
that the current dictionary learning algorithm employed in these experi- 
ments, M-DLA, is sensitive to these session-dependent variations. The vari- 
ations could be caused by the electronic system or could originate from the 
subject. A first conclusion is that a BCI system should be recalibrated at 
the beginning of each session to avoid a performance drop. The second 
conclusion concerns the dictionary learning algorithm: a training dataset 
including trials from several sessions could reinforce the robustness of the 
dictionary. 

The last experiment concerns the classes of the motor imagery task. This 
time the distances are computed between classes for each subject, instead 
of computing distances between subjects for each class as in the previous 
experiment. The M-DLA is applied with the same parameters, the affinity 
propagation is applied on the classes for each subject and each ground metric 
(Binet-Cauchy, Fubini-Study, chordal and geodesic). Then the clusters of 
all subjects and metrics are merged in cluster ensembles with the consensus 
algorithm. The resulting cluster ensembles are shown on Figure 11: the 
hands and tongue imagery are gathered in the same cluster. The feet are set 
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Figure 11: Consensus clustering on motor imagery classes: the cluster en- 
semble reflects the neurobiological organization of the brain. 

in a separate cluster. This clustering reflect the neurobiological organization 
of the brain: on the one hand the localization of the cortical areas devoted 
to hand and tongue movements are situated on the sides of the head, in the 
parietal areas. On the other hand, the surface of the motor cortex dedicated 
to the feet movements is situated on the interior side of the interhemispheric 
fissure, generating EEG components centered on the top of the head. 

These experiments demonstrate that, properly equipped with a metric, 
the dictionary learning offer an efficient and robust approach to assess the 
intrinsic properties of multivariate datasets. The formalization of a metric 
space applicable to frames open several direct applications, as shown in this 
section. 

9 Conclusion 

This contribution rely on advances from algebraic geometry and frame the- 
ory to define suited metrics for multivariate dictionaries. It is the first con- 
tribution to define such metric spaces for learned dictionaries. The distance 
between dictionaries is computed as the distance between their subspaces, 
yielding a pseudo-metric space which is invariant to linear transformations, 
an essential property to compare multivariate dictionaries. 

The interest of the described set-metric is shown through its direct ap- 
plication on two examples: a synthetic dataset and real dataset of EEG for 
Brain Computer Interfaces. The defined metrics allow to estimate empir- 
ically the convergence of a dictionary learning algorithm with a precision 
outperforming the classical measurements based on detection rates. Fur- 
thermore, they also allow to qualitatively assess a specific dataset as it is 
shown with the BCI experiments. 

In experiments on synthetic and real datasets, it is shown that the set- 
metrics could be adapted to the specific properties of the considered data, as 
adding invariance to rotations by choosing a suited ground distance. Indeed, 
the choice of the ground distance offers a large amount of possibilities to 
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capture some specific properties linked to the properties of the underlying 
data. Ground metrics such as the chordal or the geodesic distance are a 
good default choice. 

The chosen examples amount from multivariate data analysis, illustrat- 
ing that the proposed metrics could be applied straightly on these complex 
problems. Nonetheless, all the work described in this contribution is also 
valid for the univariate case and operates with every dictionary learning algo- 
rithms. But in numerous applications, multivariate datasets are transformed 
to be processed by univariate algorithms, causing an important information 
loss. Metrics for multivariate dictionaries bring a framework to develop new 
and improved multivariate dictionary learning algorithms. 

At least, an original connection is made between Grassmannian frames, 
compressed sensing and dictionary learning. This disjoint research topics try 
either to cover the entire space under consideration, as in packing problems, 
or to minimize the reconstruction error on the signal energy, as in dictionary 
learning, or to add coherence constraints on dictionary learning algorithms, 
as in hybrid approaches. 
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Appendix A. 



In this appendix, we explicit the definitions of manifolds, matrix manifolds 
and Grassmannian Manifold. 

Manifolds 

Definition 27 Let us consider an n- dimensional vector space V with inner 
product {■, ■) defined over the field M. A manifold is a, topological space M 
of class (differentiable up to fc*'^ degree of differentiation) together with 
an atlas A = {(f)i,Zi)i^i on V, with I G N an indexing set. An atlas is a 
collection of charts, a chart {(pi,Zi) being a,n homeomorphism of the open 
subset Zi d M. onto an open subset (piZi) c V, that is {(pi, Zi) is a bijection, 
is continuous and its inverse 4>~^ is continuous. 

A manifold is differentiable if it endows a globally defined differentiable 
structure. To induce a globally differentiable structure from the charts, 
the composition of intersecting charts must be a differentiable function and 
the charts are said C'^-compatible. For example if two charts overlap, the 
coordinates defined by a chart are required to be differentiable with respect 
to the coordinates defined in the other chart. Formally, any two charts 
{(f>i,Zi), {(f)j,Zj) are C'^-compatible if 

(pij := {(pi o (pJ^\4>{ZinZj) ■ (pji^i n Zj) (pi{Zi n Zj)} , 

and its inverse <p)ji are of class C*^. U M. is C'^ for all k, M. is said smooth 
or C°°-manifold. 

Matrix Manifolds 

First, remark that if we let {cj}"^^ be a basis of V, then the function 

(p:V ^ M^jU [ui, ...,Un] 

such that u = UjCj is a chart of the set V. As all the charts constructed 
in this way are compatible (or "smooth"), they form an atlas of the set V, 
endowing V with a manifold structure. Thus, every vector space is a linear 
manifold. 

The set M"^^ with the usual sum and multiplication by a scalar is a 
vector space. Hence, it is endowed with a linear manifold structure and a 
chart of this manifold is 

(P:Wxe^ [/ ^ vec(?7) , 

where vcc(?7) denotes the vcctorization of the matrix U. The set R"^^ 
together with its linear manifold structure is denoted M hereafter. The 
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Figure 12: Hierarchy of matrix manifolds 

manifold M is the set of all matrices in M"^^, that is the set of all linear 
applications C : — or equivalently the set defined as the 
tensor product of and M". Also, M can be seen as an Euclidean space 
with the inner product {U,W) := vec(C/)"^ vec(VF) = trace{U^W) and the 
induced norm ||C/||f- 

Definition 28 Let n and g a positive integer with g ^ n. The nonorthog- 
onal Stiefel manifold St{g,n) is the set of all full rank matrices in M, that 
is 

St{g,n) := {[/ G M : rank([/) = g} . 

The column space U of a matrix U in St(g,n) defines the basis of a 
^-dimensional subspace in M", that is the g column vectors {ui, . . . ,Ug) of 
U span the subspace U, thus span(C/) = U. For n = g, the Stiefel manifold 
St(^, g) reduces to the general linear group G'L{g). 

Stiefel manifold defined here is different from the orthogonal Stiefel man- 
ifold St*, which is the set of orthonormal n-hj-g matrices. 

Definition 29 Let n and g be two positive integers such that g ^ n. The 
orthogonal Stiefel manifold St* {g,n) is defined as 

St* {g,n) := {[/ G M : U^U = /J , 

with Lg the g-hy-g identity matrix. 

For g = 1, St*(l, n) reduces to the unit sphere S"^""^ and for g = n^ii reduces 
to the set of orthogonal matrices 0(n), as shown on Figure 12. 
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Figure 13: Construction of the Grassmannian manifold. Here, the projection 
TT verifies Tr~^{TT{U)) = {UG'L{q) : U G GL(^))} and u is the projection of u 
on Gr{Q, n) such that u = uo tt. 

Grassmannian Manifold 

The Grassmannian manifold Gr{Q,n) is a quotient manifold, such as an 
element U of Gr(^, n) can also be caracterized as a £»-dimensional subspace 
in R". A matrix U is said to span U if span(C/) = and the span of U is 
an element of Gr(^, n) only if [/ G St{g,n). There is an infinite number of 
matrices in St(^, n) which span an element U of Gr{g,n), as illustrated on 
Figure 13. 

Definition 30 Given a matrix U € St{g,n), the set of the matrices that 
have the same span as U is defined as 

UGL{g) := {UA : A G GL(^)} , 

where GJj{g) is the set of g-by-g invertihle matrices. The Grassmannian 
manifold Gr(f), n) is defined as the orbit space of 

St(£»,n)/GL(^) := {UG\.{g) : U G St(^,n)} , 

with the group action St{g,n) x GL(^) — )• St{g,n), {U,A) ^UA. 

In the case n = 2 and g = 1, an element of a Grassmannian manifold 
defines an equivalence line in the plane, that is a set of points in St{g,n) 
which are invariant under the group GL(g). In a more general case, each 
element lA a. GT{g, n) defines a subspace and for each lA corresponds an 
equivalent class of matrices that span U. 
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